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ABSTRACT 


'  /*•  t!  - 

In  this  paper  we  present  a  parallel  procedure  for  the  solution  of  first-order 
linear  recurrence  systems  of  size  N  when  the  number  of  processors  p  is  small  in 
relation  to  N.  We  show  that  when  1  <  p2  <  N,  a  first-order  linear  recurrence 
system  of  size  N  can  be  solved  in  5(iV— l)/(p  +1)  steps  on  a  p  processor  SIMD 
machine  and  at  most  5(Ar  —  — )/(p  +  — )  steps  on  a  p  processor  MIMD 

machine.  As  a  special  case,  we  further  show  that  our  approach  precisely 
achieves  the  lower  bound  2(iV— l)/(p+l)  for  solving  the  parallel  prefix  problem 
on  a  p  processor  machine. 
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I.  INTRODUCTION 

In  this  paper  we  present  a  parallel  algorithm  to  solve  the  well-known  first- 
order  linear  recurrence  system  R<N,1>  when  the  number  of  processors  p  is 
small  in  relation  to  N,  and  where  R<N,1>  is  defined  as  follows: 

R<N,1>:  Given  N,  given  b  —  (blt  b2,  ■■■  ,  bN),  and  given  a  =  (a2,  c3,..., 
ax),  compute  x  =  (x1?  x2,  ■  ■  ■  ,  X/y)  such  that  xx  =  bx  and  x,-  =  a,x,_i  +  6,  for 
I  =  2,3  ,...,iV. 

We  present  both  SIMD  and  MIMD  versions  of  the  algorithm.  We  analyze 
the  SIMD  version  by  first  considering  a  simplified  shared  memory  model  of 
parallel  computation  that  facilitates  comparison  with  previous  work.  In  that 
model  the  parallelism  exhibited  by  the  proposed  algorithm  is  examined  in  terms 
of  data  dependencies  only,  therefore  allowing  us  to  determine  the  idealized  per¬ 
formance  of  the  procedure.  We  then  consider  a  second  model  of  computation 
which  consists  of  a  SIMD  p  processor  ring  configuration  with  a  broadcasting 
capability.  In  that  model  interprocessor  communication  is  taken  into  account, 
and  a  more  realistic  analysis  of  the  algorithm  is  performed.  The  MIMD  version 
of  the  algorithm  is  analyzed  by  considering  the  same  simplified  model  as  in  the 
SIMD  case,  with  the  exception  that  the  same  operation  need  not  be  performed 
by  all  processors  at  the  same  time.  Finally,  we  observe  that  the  algorithm  can 
be  efficiently  mapped  to  an  MIMD  p  processor  ring  configuration  with  a  broad¬ 
casting  capability. 

Many  algorithms  have  been  proposed  to  solve  linear  recurrences  in  parallel, 
each  with  different  objectives.  Earlier  results  assumed  the  availability  of  an 
unlimited  number  of  processing  elements  and  were  concerned  with  determining 
the  number  of  processors  necessary  to  achieve  minimal  computational  time 
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[KOG73],  [CHE75],  [KUC76],  [SAM77].  Later,  limited  processor  solutions  were 

considered.  Chen,  Kuck  and  Sameh  [CHE78]  presented  a  SIMD  algorithm  that 

solved  M-th  order  systems  in  (“)(2M2+3M)  +  0(A/2log2  -^7)  steps,  but 

p  M 

did  not  discuss  any  specific  parallel  implementation.  Gajski  [GAJ81]  improved 
upon  this  result  by  performing  the  SIMD  computation  in  less  than 
(— )(2M2+3M)  steps  using  p  <  N *  processors  in  a  shared  memory  architec- 

p 

ture.  In  this  paper  we  show  that  by  using  a  SIMD  p  processor  ring  network 
modified  to  support  global  broadcasting,  the  number  of  steps  required  to  solve  a 
first-order  linear  recurrence  of  size  N  >  p2  is  5(iV— l)/(p+l).  This  improves 
upon  the  results  of  [CHE78]  and  [GAJ81]  for  the  first-order  case  when  N  >  p2. 
Our  approach  is  a  generalization  of  the  matrix  factorization  technique 
presented  in  [MEY85],  and  it  reduces  to  the  SIMD  procedure  presented  in 
[GAJ81]  when  N  =  p2. 

Moreover,  when  a,  —  1,  for  all  *  in  [l,2,...,iV],  R<N,  1>  reduces  to  a  par¬ 
ticular  form  of  the  parallel  prefix  problem.  When  N  >  p,  Kruskal,  Rudolph 
and  Snir  [KRU85]  present  an  algorithm  which  solves  the  parallel  prefix  problem 
in  2 N /p  +  2 log2p  -  2  steps.  Snir  [SNI86]  improves  upon  this  approach  when 
N  >  p2  by  solving  the  problem  in  2Ar/(p+l)  -I-  0(1)  steps,  which  is  within  a 
constant  additive  term  of  the  lower  bound.  In  case  of  parallel  prefix,  we  show 
that  our  algorithm  precisely  achieves  the  lower  bound  2(N— l)/(p+l)  esta¬ 
blished  by  Snir  [SNI86]  when  N  >  p2. 

Carlson  [CAR84]  considered  mapping  the  computation  of  first-order  linear 
recurrence  systems  to  perfect  shuffle  and  cube-connected  systems.  A  common 
feature  of  both  Carlson’s  algorithm  and  our  algorithm  is  that  the  computation 
is  organized  so  that  the  transfer  of  input  and  output  data  can  be  performed 


concurrently  with  the  execution  of  the  algorithm,  thus  providing  a  balance 
between  I/O  and  processing  loads. 

An  important  problem  related  to  the  parallel  solution  of  R<iV,l>  is  the 
parallel  evaluation  of  general  arithmetic  expressions  [BRE74].  In  the  case  of 
first-order  linear  recurrences  of  size  N,  the  problem  is  to  efficiently  evaluate  x ^ 
in  parallel  using  p  processors.  In  that  case,  we  observe  that  when  applied  to 
computing  xN  only,  a  straightforward  application  of  our  approach  does  not 
improve  upon  existing  results  [BRE74],  [CHE78]. 

In  order  to  specify  the  parallelism  exhibited  by  our  algorithms,  we  augment 
those  statements  which  can  be  executed  in  parallel.  We  use  the  following  syn¬ 
tax: 

FORALL  i  e  S  DO  IN  PARALLEL 

BODY  /*  Comments  */ 

END  FORALL 

which  indicates  that  the  BODY  may  be  executed  concurrently  for  each  i  in  the 
set  S. 

The  SIMD  procedure  to  solve  R  <iV,l>  is  presented  in  Section  II;  in  Sec¬ 
tion  III  we  discuss  a  parallel  implementation  of  the  algorithm,  and  in  Section  IV 
we  present  an  MIMD  version  of  our  algorithm  to  solve  R<N,  1>.  Finally,  our 
conclusions  comprise  Section  V. 

H.  THE  SIMD  ALGORITHM 

The  abstract  model  of  SIMD  parallel  computation  (Figure  1)  considered  in 
this  section  consists  of  a  global  parallel  memory,  p  parallel  processors,  and  an 
interconnection  network,  where  all  processors  perform  the  same  operation  at 
each  time  step.  We  further  simplify  the  model  by  making  the  following  assump- 
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tions: 


s 


Al.  Each  arithmetic  operation  (addition  or  multiplication)  is  performed  in  unit 
time,  referred  to  as  a  step. 

A2.  There  are  no  accessing  conflicts  in  global  memory;  furthermore,  all  data  is 
assumed  to  initially  reside  in  global  memory. 

A3.  There  is  no  time  required  to  access  global  memory. 

This  simplistic  model  allows  the  parallelism  of  the  proposed  algorithm  to  be 
analyzed  without  introducing  the  added  complexity  of  the  implementation.  In 
Section  III  we  map  the  algorithm  to  a  specific  computational  network  and  we 
analyze  the  corresponding  implementation. 


Given  N  and  p,  our  approach  to  solving  R<N,  1>  consists  of  partitioning 

N- 1 


R<N,l>  into  a  sequence  of 


p2-l 


reduced  recurrence  systems  R<n,  1>, 


each  of  size  n  =  p2,  except  for  the  last  recurrence  system  which  may  be  of  size 
less  than  p2.  Each  R<n,  1>  is  then  solved  in  parallel  with  its  initial  value 
taken  as  the  final  value  obtained  from  the  previously  solved  reduced  recurrence 
system,  except  for  the  first  recurrence  that  uses  xl  =  bx.  The  first  R<n,  1>  is 
solved  as  follows:  (i)  each  of  the  p  processors  concurrently  computes  a  partial 
solution  for  a  different  x,-;  (ii)  after  p  parallel  iterations  p2  partial  solutions 
have  been  computed,  one  for  each  x,,  i  in  [1,2, ...,p2],  where  the  partial  solutions 
for  x,,  t  in  [l,2,...,p],  are  precisely  the  solutions  for  R<N,  1>;  (iii)  based  upon 
xp  the  next  p  partial  solutions  x,-,  i  in  [p+l,p+2,...,2p]  are  then  updated  in 
parallel  to  their  correct  values.  After  p—  1  parallel  update  iterations  R<n,  1> 
is  solved.  The  next  reduced  recurrence  system  of  size  p2  is  then  solved  with  xp2 
as  its  initial  value.  We  continue  in  this  manner  until  the  last  R<n,  1>  is 
solved.  Since  the  initial  and  final  values  of  each  f?<n,l>  overlap,  the  complete 


solution  of  R<N,  1>  requires  solving  — - -  reduced  recurrence  systems. 

|p--i  | 

N—l 

We  now  describe  the  SIMD  algorithm  to  solve  i?<iV,l>  when  — - - is 

P2- 1 

an  integer.  For  win  [0,1 AT — l)/(p2— 1)],  let  the  index  sets  and  be 
defined  as 

Su  —  {l  +  mp  +  w(p2— 1)  :  m  =  l,2,...,p— l} 

and 

=  {l  +  mp  +  w(p2— 1)  :  m  =  0,l,...,p  — l}. 

1.  PROCEDURE  R(N,p,a,b) 

2.  Xj  :=  bY 

3.  FOR  w  :=  0  TO  (iV— l)/(p2— 1)  —  1  DO  /*  Solve  each  R<n,l>  */ 

4.  FORALL  «  t  Su  DO  IN  PARALLEL  /*  Begin  Coefficient  Computation  Phase  */ 

5.  A  [*,»)  :=  a,-; 

6.  END  FORALL 

7.  FOR  :=  1  TO  (p-1)  DO 

8.  FORALL  j  «  Sw  DO  IN  PARALLEL 

9-  A[i+j,j\  :=  ai+j  A[i+;-l,;']; 

10.  END  FORALL 

11.  END  FOR  /*  End  Coefficient  Computation  Phase  */ 

12.  FORALL  t  (  Su  DO  IN  PARALLEL  /*  Begin  Partial  Solution  Phase  */ 

13.  *,  6,; 

14.  END  FORALL 

15.  FOR  i  :=  1  TO  (p-1)  DO 

16.  FORALL  j  c  Tu  DO  IN  PARALLEL 

xi+i  ai+j  xi+i-l  +  bj+j\ 


17. 


18. 


END  FORALL 


/*  End  Partial  Solution  Phase  * / 
/*  Begin  Solution  Update  Phase  */ 


19.  END  FOR 

20.  FOR  «  f  Su  DO 

21.  FORALL  j  :=  0  TO  (p— 1)  DO  IN  PARALLEL 

22.  xi+)  :=  A  [»+i,»]  +  xi+J; 

23.  END  FORALL 

24.  END  FOR  /*  End  Solution  Update  Phase  */ 

25.  END  FOR 

26.  END  PROCEDURE 

N— l 

The  preceding  algorithm  sequentially  solves  — - - reduced  recurrence 

p2-l 

systems  of  size  p2,  each  in  parallel.  Each  reduced  system  is  solved  in  three 
phases:  the  coefficient  computation  phase  consists  of  the  execution  of  loops  4 
and  7  and  computes  all  coefficients  of  the  form  ai+jai+j_x  •  ■  ■  ay  which  are 
needed  later  during  the  solution  updates;  the  partial  solution  phase  consists  of 
the  execution  of  loops  12  and  15  and  computes  p2  partial  solutions,  in  which  the 
first  p  partial  solutions  are  the  actual  solutions;  and  finally,  the  solution  update 
phase  consists  of  the  execution  of  loop  20  in  which  the  coefficients  computed  in 

the  first  phase  are  used  to  update  the  next  p  estimates  at  each  iteration.  The 

N _ i 

complete  solution  to  R<N,1>  is  therefore  obtained  after  executing  — ;; - 

P  - 1 

iterations  of  loop  3.  An  example  illustrating  the  computations  performed  by  the 
algorithm  is  given  in  Figure  2  for  the  case  N  =  17  and  p  =  3  where  the  nota¬ 
tion  Xi  is  used  to  indicate  a  correct  value  for  the  solution.  Note  that  each  com¬ 
putational  level  may  be  performed  in  parallel  using  at  most  three  processors. 

Our  model  assumptions  imply  that  we  need  only  consider  computational 
operations  when  determining  the  number  of  steps  required  by  the  algorithm. 


Therefore  we  must  examine  those  computations  performed  in  loops  7,  15  and  20. 
The  execution  of  loop  7  is  performed  p  —  1  times,  each  iteration  requiring  p—  1 
processors  to  concurrently  perform  a  single  multiplication,  thus  loop  7  requires 
p— 1  steps.  Both  loops  15  and  20  are  iterated  p—  1  times,  each  iteration  requir¬ 
ing  p  processors  to  concurrently  perform  a  multiplication  followed  by  an  addi¬ 
tion.  Thus,  loops  15  and  20  each  require  2(p— 1)  steps.  The  total  number  of 
steps  required  to  solve  each  reduced  recurrence  system  i?<p2,l>  using  p  pro¬ 
cessors  is  therefore  5(p— 1),  and  hence  the  resulting  theorem  follows. 

N—l 

Theorem  1:  Given  N  and  p  such  that  — - - is  an  integer,  the  number  of 

p*-l 


steps  required  to  solve  the  linear  recurrence  system  R<N,l>  using  a  SIMD 
parallel  computer  with  p  processors  is  — 

p+1 

N—l 

If  — - - is  not  an  integer,  our  approach  to  solving  R<N,1>  requires 

P  — 1 

that  we  solve  the  reduced  recurrence  system  R<nr,  1>,  where  nT  <  p2.  One 
approach  to  solving  R<nr,  1>  is  to  use  a  technique  which  is  known  to  be 
efficient  whenever  nT  <  p2.  Applicable  techniques  include  the  algorithms 
presented  by  Chen,  et  al.  [CHE78]  and  Kogge  and  Stone  [KOG73];  however, 
these  techniques  are  not  desirable  because  they  require  the  machine  to  store 
and  execute  multiple  algorithms  based  upon  the  size  of  the  recurrence  system. 

A  less  efficient  but  more  practical  approach  to  solving  i?<nr,l>  consists  of 
using  the  proposed  technique  to  solve  the  augmented  system  i?<p2,l>  and  sim¬ 
ply  terminate  the  computation  when  the  last  x,-  is  computed.^  In  that  case  the 

N- 1 


number  of  steps  required  by  the  algorithm  is  at  most 


P2- 1 


5(P-1). 


Finally,  we  make  the  observation  that  the  above  SIMD  algorithm  most  not¬ 
ably  differs  from  the  approaches  presented  in  [CHE78]  and  [GAJ81]  in  that  our 
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approach  partitions  the  problem  and  sequentially  solves  a  series  of  reduced 
recurrences  of  size  p2.  However,  when  TV  =  p2,  our  approach  reduces  to  that  of 
[GAJ81],  except  that  Gajski  presents  the  coefficient  computation  and  partial 
solution  phases  as  a  single  computational  phase.  Moreover,  when  TV  >  p2,  the 
algorithm  of  Chen,  et  al.  [CHE78]  is  less  efficient  than  Gajski’s  as  a  result  of 
implementing  an  extra  computational  phase  in  which  a  separate  first-order 
recurrence  of  size  p  is  solved  using  p  processors,  requiring  an  additional  2 log0p 
steps. 

When  TV  and  p  are  powers  of  two,  the  algorithm  of  Chen,  et  al.  requires 
5/V  j y 

-  -f  2log2p  -  5  steps  [CHE78]  and  when  ~r  is  an  integer,  Gajski’s  SIMD 

P  p* 


1 V 

algorithm  requires  — ~(5p—  3)  —  2  steps  [GAJ81],  whereas  when 


TV- 1  . 

— 5 is 

p2-l 


integer,  our  SIMD  algorithm  requires  only  — 1 - L  steps.  For  example,  when 

P+1 

TV  =  218  and  p  =  23,  the  number  of  steps  required  by  the  SIMD  algorithms 
presented  in  [CHE78],  [GAJ81]  and  this  paper  are  163,841,  151,550  and  145,635, 
respectively. 

Finally,  when  a,  =  1,  for  all  i  in  [1,2,... ,TV],  T?<TV,1>  is  a  particular  form 
of  the  parallel  prefix  problem  and  reduces  to  computing  the  cascade  sums 
(6j+62),  (61+62+63),  ...,  (6j+62+  •  ■  •  +bN)  in  parallel  using  p  processors. 

The  following  corollary  is  a  direct  consequence  of  Theorem  1. 

TV— 1 

Corollary  1:  Given  TV  and  p  such  that  — - -  is  an  integer,  the  number  of 

p  -1 

steps  required  to  solve  the  parallel  prefix  problem  using  a  SIMD  parallel  com¬ 


puter  with  p  processors  is  — - '-. 

P+1 

TV— 1 

Thus,  when  — -  is  an  integer,  our  SIMD  algorithm  precisely  achieves 

p--l 


pS5 


the  parallel  prefix  computational  lower  bound  - L  established  by  Snir 

p+1 

[SNI86].  This  result  improves  upon  existing  approaches  to  solving  the  parallel 

prefix  problem  when  N  >  p2.  In  that  case  the  parallel  prefix  problem  is  solved 

9/v  N 

in  — -  +  0(1)  steps  by  Snir’s  algorithm  [SNI86],  2 - h  log2p  -  2  steps  by 

p+1  p 

the  data  independent  algorithm  presented  by  Kruskal,  et  al.  [KRU85],  and 
V 

-L^-(2p— 1)  —  1  steps  by  Gajski’s  algorithm  [GAJ81}. 

p- 

III.  THE  SIMD  PARALLEL  IMPLEMENTATION 

The  abstract  model  of  SIMD  parallel  computation  presented  in  Section  II 
neglected  the  issues  of  data  organization  and  alignment  as  well  as  communica¬ 
tion  overhead,  all  of  which  are  highly  machine  dependent.  We  now  present  a 
parallel  implementation  of  the  proposed  algorithm  that  takes  these  issues  into 
account.  The  SIMD  model  of  computation  considered  in  this  section  (Figure  3) 
consists  of  p  processors  executing  the  same  operation  in  lock  step,  with  each 
processor  containing  its  own  local  storage.  The  processors  are  interconnected 
by  a  unidirectional  ring  network  in  which  processor  i  transfers  data  to  proces¬ 
sor  i+l,  i  in  [l,2,...,p  —  l]  and  processor  p  transfers  data  to  processor  1.  Furth¬ 
ermore,  we  assume  that  the  network  possesses  a  broadcasting  capability  that 
allows  any  processor  to  broadcast  data  to  all  other  processors.  The  time 
required  by  the  algorithm  will  be  determined  under  the  following  assumptions: 

Al.  Each  arithmetic  operation  (addition  or  multiplication)  is  performed  in  unit 
time,  referred  to  as  a  step. 

A2.  Interprocessor  transfers  require  one  step. 

A3.  Data  broadcasts  require  one  step. 


Vi 

ft 

$ 

V 

& 


A4.  Each  a,  and  6,-  required  by  a  processor  is  assumed  to  initially  reside  in  the 
local  memory  of  that  processor. 

In  order  to  determine  an  efficient  processor  assignment,  we  first  make  the 
observation  that  the  p  consecutive  partial  solutions  updated  at  each  iteration 
of  the  update  phase  of  the  algorithm  must  reside  in  a  different  processor. 
Furthermore,  both  the  coefficient  computation  phase  and  the  partial  solution 
phase  of  the  algorithm  exhibit  explicit  data  dependencies  which  must  be 
preserved.  These  constraints  can  be  satisfied  if  we  rotate  the  processor  assign¬ 
ment  at  each  parallel  iteration  of  the  algorithm,  and  in  that  case,  the  algorithm 
can  be  directly  mapped  to  a  SIMD  p  processor  unidirectional  ring  network  with 
broadcasting  capability.  Figure  4  illustrates  such  a  processor  assignment  and 
the  corresponding  communication  requirements  for  the  case  N  =  17  and  p  —  3. 

We  now  present  the  algorithm  to  solve  R  <iV,l>  as  executed  by  processor 

k,  for  all  k  in  [1,2 

l.  PROCEDURE  Rk{N,p,a,b) 

2.  z,  :=  6, 

3.  FOR  <j  :=  0  TO  (TV— l)/(p2— 1)  -  1  DO  /*  Solve  each  R<n,l>  */ 

4.  A  [«,»]  :=  a,  ;  /*  Begin  Coefficient  Computation  Phase  */ 

r * =  i  +  (k-i)p  +  u(P2- 1)  v 

5.  FOR  «'  :=  1  TO  (p-1)  DO 

6.  A [«'+;',/]  :=  ol+y  A[i+j-\,j\-, 

/*  j  -  (l+(*-t'-l)p)  mod  p 2  +  w(p2-l)  */ 


7.  END  FOR 


8.  IF  k  =  1  THEN  *,•  :=  z,-  ELSE  z,-  :=  6<; 

/*«  =  !  +  (k-l)p  +  w(p2-l)  */ 

9.  FOR  «  :=  1  TO  (p-1)  DO 


/*  End  Coefficient  Computation  Phase  */ 
/*  Begin  Partial  Solution  Phase  * / 


10. 
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*•+/  :=  ai+i  *•'+/- 1  +  h+jl 
I*  j  =  (l+(*-«-l)p)  mod  p2  +  u(p2- 1)  */ 

11.  END  FOR  /*  End  Partial  Solution  Phase  * / 

12.  FOR  m  :=  1  TO  (p—  1)  DO  /*  Begin  Solution  Update  Phase  */ 

13.  *,•+/  :=  A  [.■+,•,*]  x,.,  +  x, 

/*  «  =  l+mp+w(p2-l),  j  -  (p  —  m+k—1)  mod  p  */ 

14.  END  FOR  /*  End  Solution  Update  Phase  */ 

15.  END  FOR 

16.  END  PROCEDURE 


Our  model  assumptions  imply  that  we  must  consider  interprocessor  com¬ 
munication  in  addition  to  operational  count  when  determining  the  number  of 
steps  required  by  the  algorithm.  Therefore,  we  must  examine  the  computations 
and  interprocessor  transfers  performed  in  loops  5,  9  and  12.  Each  iteration  of 
loop  5  requires  an  interprocessor  transfer  of  A  [t+j—  l,j]  followed  by  a  single 
multiplication.  Thus,  loop  5  requires  2(p— 1)  steps.  Loop  9  is  iterated  p—  1 
times,  each  iteration  requiring  an  interprocessor  transfer  of  xi+j_i  followed  by  a 
multiplication  and  addition.  Thus,  loop  9  requires  3(p— 1)  steps.  Loop  12  is 
also  iterated  p— 1  times,  each  iteration  requiring  a  data  broadcast  of  x,_!  fol¬ 
lowed  by  a  multiplication  and  addition.  Thus,  loop  12  also  requires  3(p— 1) 
steps.  The  total  number  of  steps  required  to  solve  each  reduced  recurrence 
R<p2, 1>  using  p  processors  is  therefore  8(p— 1),  and  hence,  the  resulting 
theorem  follows. 

yV— 1 

Theorem  S:  Given  N  and  p  such  that  — - - is  an  integer,  the  number  of 

P  -1 


Among  the  existing  SIMD  algorithms  to  solve  R<N,  1>,  the  SIMD  algo¬ 
rithm  presented  by  Gajski  [GAJ81]  can  be  most  efficiently  mapped  to  a  uni¬ 
directional  ring  network  with  broadcasting  capability.  Based  upon  the  assump- 

N 

tions  made  in  this  section,  when  — —  is  an  integer  the  number  of  steps 

P 

N 

required  by  Gajski’s  approach  to  solve  R<N,1>  is  — r(8p— 5)  —  3,  and  there- 

P 

fore  when  N  >  p2  our  approach  is  more  efficient  than  Gajski’s  when  imple¬ 
mented  upon  a  ring  network  capable  of  broadcasting. 

Finally,  we  make  the  observation  that  the  algorithm  does  not  require  all  of 
the  inputs  a,-  and  b,  in  order  for  the  processing  to  begin.  Specifically,  the  algo¬ 
rithm  requires  p2— 1  a,  and  p2  b,  for  every  5(p— 1)  computational  steps, 
corresponding  to  solving  each  R  <p2,l>  in  sequence.  Similarly,  the  outputs  xt- 
are  produced  in  blocks  of  p2— 1  at  a  time.  This  suggests  that  I/O  could  be 
overlapped  with  the  computation,  providing  a  balance  between  I/O  and  process¬ 
ing  loads,  and  therefore  the  deletion  of  assumption  A4  has  a  negligible  effect  if 
one  assumes  that  I/O  and  processing  can  be  done  concurrently. 

IV.  THE  MIMD  ALGORITHM 

In  this  section  we  again  consider  the  simplistic  model  of  computation  given 
in  Section  II  with  the  exception  that  we  no  longer  require  all  processors  to  exe¬ 
cute  the  same  operation  at  each  step;  that  is,  we  now  consider  a  MIMD  imple¬ 
mentation  in  which  we  neglect  the  issues  of  data  organization  and  alignment  as 
well  as  communication  overhead. 

The  MIMD  approach  for  solving  R<N,  1>  is  based  upon  the  observation 

that  only  p— 1  processors  are  needed  at  each  iteration  of  the  coefficient  compu- 

N _ l 

tation  phase.  Assuming  — - - to  be  an  integer,  the  total  number  of  multipli- 


cations  required  to  compute  all  necessary  coefficients  is  J ^ - L,  p—  1  of 

P  +1 

which  may  be  performed  concurrently  at  each  step.  Therefore,  all  of  the 
required  coefficients  can  be  computed  in  steps  using  p — 1  processors. 

This  leaves  one  processor  free  for  (JV— l)/(p+l)  steps,  allowing  us  to  expand  the 

•  _  -  f  it _  _  1 _  i  i  ]  N~  1  |  ,  .  .  a 


size  of  the  recurrence  by  at  most  n0  = 


2(P+1) 


and  use  the  free  processor 


to  concurrently  solve  the  reduced  system  /2<n0,l>.  Thus,  using  an  MIMD 
approach  we  can  solve  the  entire  system  i?  <Ar-fn0,l>  in  — * - L  steps. 


Given  a  recurrence  system  of  size  N  and  the  number  of  processors  p  the 
following  lemma  expresses  n0  in  terms  of  N  and  p  only. 


Lemma  1:  Given  N  and  p,  n0  = 


2  p  +3 


Given  N  and  p,  our  MIMD  approach  to  solving  R<N,  1>  consists  of  parti- 

FiV— n0- 1 1 

•  n  nr  <  n.  •  i  f»l  u  I  .  4  i  i 


tioning  R<N,1>  into  a  sequence  of 


p2-l 


+  1  reduced  recurrences. 


The  first  recurrence  is  of  size  n0+l  and  all  others  are  of  size  p2,  except  for  the 
last  recurrence  that  may  be  of  size  less  than  p2.  The  coefficient  computation 
phase  of  the  algorithm  uses  p— 1  processors  to  compute  all  needed  coefficients 
for  all  of  the  reduced  systems.  Concurrent  with  this  computation,  the  free  pro¬ 
cessor  computes  the  solution  to  /?<n0+l,I>.  Each  subsequent  R<n,  1>  is 
then  solved  in  the  same  manner  as  in  the  SIMD  algorithm  by  executing  a  par¬ 
tial  solution  phase  followed  W  a  solution  update  phase.  The  complete  solution 

\N-n0-l  1 

is  obtained  after  solving  all  - - -  -1-  1  reduced  recurrences. 


p2-l 


We  now  present  the  MIMD  algorithm  to  solve  i?<iV,l>  when 


N—Uq— 1 
P2-l 


is  an  integer.  As  in  the  SIMD  case,  it  is  not  difficult  to  modify  the  algorithm  if 


the  above  assumption  is  not  satisfied  by  simply  terminating  the  computation  at 
the  point  when  the  last  i,  is  updated.  For  win  [0,1  — n0— l)/(p2— 1)J,  we 

now  define  the  index  sets  Uu  and  Fw  as 


=  {l  +  n0  +  mp  +  w(p2— l)  :  m  =  l,2,...,p— 1} 

and 


Vw  =  {l  +  n0  4-  mp  +  w(p2- 

— 1)  :  m  =  0,1,... ,p  — l}. 

1. 

PROCEDURE  R (N,p,a,b) 

2. 

z,  :=  6j 

3. 

FOR  «  :=  2  to  n0+l  DO 

/*  Solve  i?<n0,l>  */ 

4. 

1  + 

5. 

END  FOR 

/*  End  R<n0ll>  Solution  */ 

6. 

FOR  w  :=  0  TO  (N-n0-l)/(p2-l)  -  1  DO 

/*  Begin  Coefficient  Computation  Phase  */ 

7. 

FORALL  i  t  Uu  DO  IN  PARALLEL 

8. 

A[i\i]  :=  a,-; 

9. 

END  FORALL 

10. 

FOR  «  :=  1  TO  (p-l)DO 

11. 

FORALL  j  f  Uu  DO  IN  PARALLEL 

12. 

A[i+j,}\  :=  ol+y  A[i+j-l,j]; 

13. 

END  FORALL 

14. 

END  FOR 

15. 

END  FOR 

/*  End  Coefficient  Computation  Phase  */ 

16. 

FOR  w  0  TO  (N-n0-l)/(p2-l)  -  1  DO 

/*  Solve  each  /?<n,l>  */ 

17. 

FORALL  •  t  U„  DO  IN  PARALLEL 

/*  Begin  Partial  Solution  Phase  */ 

21. 


FORALL  j  e  Vu  DO  IN  PARALLEL 


22. 

xi+i  ai+i  xi+j-l  +  b<+}  '< 

23. 

END  FORALL 

24. 

END  FOR 

/*  End  Partial  Solution  Phase  */ 

25. 

FOR  .•  £  Uu  DO 

/*  Begin  Solution  Update  Phase  */ 

26. 

FORALL  j  :=  0  TO  (p-1)  DO  IN  PARALLEL 

27. 

*i+i  A  [«+/,»]  x._,  +  xi+j; 

28. 

END  FORALL 

29. 

END  FOR 

30. 

END  FOR 

/*  End  Solution  Update  Phase  */ 

31. 

END  PROCEDURE 

Note  that  (i)  the  coefficient  computation  phase  of  the  SIMD  algorithm  has 
been  modified  so  as  to  compute  the  necessary  coefficients  for  all  R<n,  1>  before 
the  first  reduced  recurrence  is  solved  in  parallel;  and  (ii)  the  processor  that  is 
idle  during  the  SIMD  coefficient  computation  phase  is  now  used  to  concurrently 
compute  the  solution  to  /Z<n0-H,l>.  An  example  illustrating  the  computa¬ 
tions  performed  by  the  MIMD  algorithm  is  given  in  Figure  5  for  the  case  N  — 

19  and  p  =  3. 

Based  upon  the  MIMD  model  considered  in  this  section,  we  conclude  that 
the  time  required  by  the  MIMD  algorithm  is  determined  by  the  computational 
operations  performed  in  loops  3,  8  and  16.  Loops  3  and  6  are  executed  con¬ 
currently,  using  1  and  p— 1  processors,  respectively.  Loop  6  requires 
N— n0— 1 

- steps,  and  the  quantity  n0  has  been  defined  so  that  loop  3  requires 

P+1 

at  most  the  same  number  of  steps  as  loop  6.  All  p  processors  are  used  in  exe- 


4(iV— n0— 1) 

cuting  loop  16,  and  thus  loop  16  requires  - steps.  The  number  of 

p+1 

5(N— n0— 1) 

steps  required  by  the  MIMD  algorithm  is  therefore  - steps.  Thus, 

p+1 

the  resulting  theorem  follows. 

Theorem  S:  Given  N  and  p  such  that  N  >  p2  +  &p  -  1,  the  number  of  steps 

required  to  solve  a  linear  recurrence  system  R<N,  1>  using  a  MIMD  parallel 

.  [  N-Vz  1  , 


computer  with  p  processors  is  at  most 


(p+3/2)(p-l) 


5(p  -1). 


When  N  —  p2  +  Vzp  -  1,  our  approach  reduces  to  the  MIMD  algorithm 
presented  in  [GAJ81]  in  which  the  number  of  steps  required  to  solve  R<N,1>  is 
at  most  — r~ — - b(p— 1).  When  N  >  p2  +  Vzp  -  1,  our  MIMD  approach 


pJ+Vs>p—  2 


differs  from  [GAJ81]  by  organizing  a  single  coefficient  computation  phase  to 
compute  the  necessary  coefficients  for  all  R<n,  1>  before  the  first  reduced 
recurrence  is  solved  in  parallel,  rather  than  including  a  coefficient  computation 
phase  as  part  of  solving  each  reduced  recurrence. 

Finally,  we  note  that,  like  the  SIMD  algorithm,  the  MIMD  algorithm  can 
also  be  directly  mapped  to  a  p  processor  unidirectional  ring  network  with 
broadcasting  capability.  Figure  6  illustrates  such  a  processor  assignment  and 
the  corresponding  communication  requirements  for  the  case  N  =  19  and  p  =  3. 

V.  CONCLUSIONS 

The  algorithm  presented  in  this  paper  exploits  the  fact  that  for  a  fixed 

number  of  processors  p,  the  parallel  approach  presented  in  [MEY85]  to  solve 

R<N,  1>  attains  maximum  speedup  —(p+1)  when  N  =  p2.  When  N  >  p2, 

5 

the  structure  of  R<N,  1>  allows  the  solution  to  be  obtained  by  sequentially 
solving  a  series  of  reduced  recurrences,  each  of  size  p2,  except  for  the  last 
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Parallel  Memory 


Computational 

Phase 

Time 

SteD 

Parallel  Computations 

Coefficient 

— 

A[4,4]  =  a4 

A[7 ,7  i  =  a7 

Computation 

Phase 

1 

— 

A[5,4]  =  a6A[4,4] 

A[8,7]  =  agA[7 ,7] 

2 

— 

A[6,4j  =  a»A[5,4] 

A[9,7]  =  a#A[8,7] 

Partial 

*i  =  *i 

*4  =  64 

II 

Solution 

Phase 

3,4 

=  4"  ^2 

*6  =  <*6*4  +  &6 

*8  =  <*8*7  +  bg 

5,6 

*3  =  a3*2  +  ^3 

*8  =  <*#*6  +  bg 

=  flgXg  +  bg 

Solution 

7,8 

*4  =  A[4,4]i3  +  *4 

*6  =  A[5,4]zs  +  zb 

*8  =  A[6,4]jc3  +  *„ 

Update 

Phase 

9,10 

*7  =  A(7,7]ij  +  *7 

*8  =  A[8,7]jtj  4-  ze 

*8  =  A[9,7]i#  +  x„ 

Coefficient 

— 

A[12, 1 2]  =  b12 

A[l5,15]  =  a16 

Computation 

Phase 

11 

— 

A[13,12]  =  a13 A[12,12] 

A[l6,15]  =  a  10A[l  5, 1 5] 

12 

— 

A[14,12]  =  a14A[l3,12] 

A[17,15)  =  ar\[l6,15j 

Partial 

*8  =  *9 

*12  =  ^12 

*16  =  ^  16 

Solution 

Phase 

13,14 

*10  =  <*  10*9  +  ^10 

*13  =  <*13*12  +  ^  13 

*18  =  <*18*16  +  ^  18 

15,16 

*11  =  <*11*10  +  bu 

*14  =  <*14*13  +  ^  14 

*17  =  <*17*18  +  ^  17 

Solution 

17,18 

*1 2  -  A[12,12)*„  +  *12 

*13  =  A[l3,12]iu  +  *13 

*14  =  A[14,12]j:Ii  +  *u 

Update 

Phase 


19,20 


*16  —  A[l5,15]i|4  +  *16 


*ic  —  A[l6,15]jti4  +  zlt 


*17  —  A[17,15]j:14  +  z17 
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Computational 

ipsi 

Parallel  Computations 

Phase 

PROCESSOR  #1 

PROCESSOR  #2 

PROCESSOR  #3 

Processor 

Resident 

a.,«  -  6,8,14,16 

a,,  i  -  2,4,9,10,12,17 

a,,  1  -  3,5,7, 1 1 , 13, 1 5 

Data 

6, ,  i  —  1,6,8,14,16 

6, ,  «  —  2,4,9,10,12,17 

b,,  i  -  3,5,7, 1113,15 

Coefficient 

1 

— 

A[4,4)  —  a4  — ► 

A[7,7]  -  a7  — 

Computation 

2,3 

A[8,7]  -  agA(7,7]  — 

— 

A(5 ,4)  -  a&A(4,4]  — 

Phase 

A 

A[6.4]  -  a»A{S,4| 

A[9,7]  -  a#A(8,7] 

— 

Partial 

5 

-£i  -  G  -* 

x<-  64  —» 

X7  “  67  - *■ 

Solution 

6,7.8 

*8  ™  ^  8  — * 

Z2  ™  ^ 2 — *■ 

Phase 

9,10, 1 1 

<J8J:S  +  G 

Zg™  fl 0^8  4*  6  g 

Z3  “  ^3^2  4"  ^  3 

Solution 

12,13,14 

Xb  -  A[6,4jxs  +  z9 

X4  -  A(4,4]x3  +  z4 

Xs  -  A[5,4]x3  +  z6 

Update 

Phase 

15,16 

Xg  -  A(8,7]x«  +  z8 

Xg  -  A[9,7]xe  +  Zg 

X7  “  A[7,7]xe  +  Z7 

Coefficient 

17 

A[  12,12)  ™  a  12 

A[  1 5, 1 5)  -  a  15  — ► 

Computation 

18,19 

A(l6,15)  -  a1#A(l5,15|  — 

Xg-e 

A[  13, 12]  -  a13A(l2,12]  — 

Phase 

20 

A[  14, 12)  -  a14A|l3,12| 

A[  17, 15)  -  o1TA[16,15] 

Xg-e 

Partial 

21 

is  — 

z  12  “  b  12— * 

z  IS  ~  b  13  — » 

Solution 

22,23,24 

Z  18  “  a  itX  16  +  6  18  — *■ 

Xio”  “ioZs  +  bio—* 

z  13  —  fl  13Z  12  +  4  13  ► 

Phase 

25,26,27 

Zh-  a  mZis  +  b  H 

z  17  “  0  17Z  IB  +  4  17 

Xu  ”  °uXio  +  b  u  x* 

Solution 

28.29,30 

X is  “  Ajl4, 12]xu  +  z u  — * 

X12  “  A[l2,12)xu  +  z12 

Xis-  A|l3,12]xn  +  z i3 

Update 

Phase 

31,32 

Xie  -  A|16,15|xh  +  z„ 

X 17  -  A[  17, 15]x  14  +  Z  ,7 

Xis  -  Ajl5,15)xi4  +  Zis 

Processor 

Resident 

Results 

z,,  i  -  1,6,8,14,16 

BBi 

OB 

Notation:  — *  indicates  that  the  data  evaluated  at  the  current  step  is  transferred  to  the  processor  on  the  right. 

— i  indicates  that  the  data  evaluated  at  the  current  step  is  broadcast  to  all  processors. 


FIGURE  4  :  Processor  Assignment  for  SIMD  Solution  of  fl<i7,i>  with  p  =  3 


Computational 
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Parallel  Computations 


Time 


Coefficient 

*1  =  6, 

A[6,6]  =  a, 

A[9,9]  =  a9 

Computation 

+ 

1 

X<i  ~  (Zo£  j 

A[7,6j  =  a7A[6,6] 

A[10,9]  =  a10A[9,9] 

R<n0+ 1,1> 

2 

2%  ~  *2  +  ^2 

A[8,6]  =  a8A[7,6] 

A[l  1 ,9]  =  a  1 1  A[l0,9] 

Solution 

Phase 

— 

A[14,14]  =  a14 

A[l7 ,17]  =  o,7 

3 

x3  =  a3*2 

A[15,14]  =  al6A[l4,14] 

A[18,17]  =  o18A[17,17] 

4 

*3  =  *3  +  b  3 

A[16,14]  =  a1#A[l5,14] 

A[l9,17]  =  a18A[l8,17] 

Partial 

* 3  ~  *3 

*8  =  fce 

*»  =  ^9 

Solution 

Phase 

5,6 

24  =  <1423  +  64 

x7  =  a7xB  +  b7 

*10  =  a10*9  +  ^10 

7,8 

*6  =  a&4  +  b6 

XS  =  °8*7  +  ^8 

*11  =  “11*10  +  ^11 

Solution 

9,10 

*8  -  A[6,6]26  +  xi 

*7  =  A  [7  ,6]zb  +  x7 

*8  =  A[8,6]26  +  Xg 

Update 

Phase 

11,12 

2g  =  A[9,9]28  +  X# 

2W  =  A[l0,9]i8  +  x,0 

*11  =  A[ll,9]i8  +  x„ 

Partial 

*11  =  *11 

II 

O' 

X 

H 

II 

o- 

Solution 

Phase 

13,14 

*12  =  fl12*ll  +  ^12 

*16  ~  a16*14  +  ^  16 

*18  —  18*17  +  &18 

15,16 

*13  =  a  13*12  +  ^  13 

*18  =  ®  18*16  +  ^18 

*19  =  “  19*18  +  ^  19 

Solution 

17,18 

•*14  =  A[l4,14]213  +  X|4 

*16  —  A[15,14]2j3  -f  X,g 

*18  =  A[16,14]2j3  +  x,e 

Update 

Phase 

19,20 

*17  ”  A[l7,17ji1(  +  x17 

*18  ™  A[l8,17]21B  +  X,g 

*19  ~  A[19,17]jt,8  +  x18 

FIGURE  5  :  Solution  of  /?<19,1>  for  3  processor  MIMD  computer 
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PROCESSOR  #1 


a, ,  i  -  8,10,16,18 

b, ,  i  -  1,8,10,16,18 


Parallel  Computations 


PROCESSOR  #2 


-  2,3, 4, 6, 11, 12, 14, 19 


-  4,6,11,12,14,19 


PROCESSOR  #3 


a.,  I  -  5,7,9  13,15,17 

A, ,  i  —  2,3  5,7  9. 13. 15, 17 


11  Z.3  — 

12,13,14  z io  ™  4 io  ~ * 

15,16,17  Zg  *  0^7  + 


18,19,20  z*  -  A[8,6|z8  +  2g  ^ 
21,22  Xio-  A(l0,9|zg  +  *io 


2 8  ”  4g  — * 

Z4-  0^3+  6  4  -► 
2  ii  “  anz10+  b  u 


Zs  -  A[6,6jz6  +  2 3 
Zn  -  A[ll,9)x8+  2 ,i 


Z7  -  A[7,6]xs  +  2 7 
ZB  -  A(9,9]zg  +  2 b 


23 

24,25,26 


2  ig  —  0  igT  17  +  b  if  ■ 


27,28,29  2ig  —  a  i3Zi8  +  4 13 


2u  “  4  14  — » 

Zi2—  Oiaiu  +  4i2- 
2  IB  —  0  1B2  18  +  4  IB 


* 17  —  4  17  — *■ 

1 16  “  0  (52  14  +  4  15  ■ 
Z 13  —  a  131 12  +  4  13  - 


30,31,32  Zi3  ™  A[16,14]zi3  +  2ib  Z14  A(14,14]zi3  +  214  Z15  “  A[15,14]zi3  +  215 

33,34  Z18  ™  A[l8,17]zie  +  Zjg  Zib  *  A[  19, 17]z  ge  4-  2ib  Z17  *  A[l7,17]zie  +  217 


-  1,8,10,16,18 


-  4,6,11,12,14,19 


Notation:  — *  indicates  that  the  data  evaluated  at  the  current  step  is  transferred  to  the  processor  on  the  right, 

indicates  that  the  data  evaluated  at  the  current  step  is  broadcast  to  all  processors. 

FIGURE  6  :  Processor  Assignment  for  MIMD  Solution  of  R< i9,i>  with  p  —  3 
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